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Abstract — In  this  paper  we  describe  the  current  status  of  our 
ongoing  effort  to  optimize  the  efficiency  of  a  novel  application 
package  for  Nonhydrostatic  Unified  Model  of  the  Atmosphere 
(NUMA)  when  running  on  large  cluster  architectures.  The 
linear  solver  within  a  distributed  memory  paradigm  is  critical 
for  overall  model  efficiency.  The  goal  of  this  work-in-progress 
is  to  investigate  the  scalability  of  the  model  for  different 
solvers  and  determine  a  set  of  optimal  solvers  to  be  used  for 
different  situations.  We  will  describe  our  novel  approach  and 
demonstrate  its  scalability  on  a  variety  of  clusters  of  multicore 
node  clusters.  We  also  present  performance  statistics  to  explain 
the  scalability  behavior. 
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ability,  performance; 

I.  Introduction 

Current  limited-area,  or  mesoscale,  atmospheric  mod¬ 
els  require  modeling  nonhydro  static  effects,  while  next- 
generation  global  models  are  moving  toward  the  nonhy¬ 
drostatic  regime.  The  nonhydro  static  atmospheric  models, 
which  run  at  resolutions  finer  than  10  km,  possess  fast- 
moving  acoustic  and  gravity  waves.  These  fast  moving 
waves  require  a  stringent  CFL  condition  if  the  equations 
are  discretized  explicitly.  To  mitigate  this  problem,  semi- 
implicit  (SI)  time-integrators,  based  on  a  Schur  complement 
technique,  have  been  developed  for  our  Nonhydrostatic 
Unified  Model  for  the  Atmosphere  (NUMA).  SI  time- 
integrators,  which  discretize  the  linear,  fast  moving  waves 
implicitly,  while  discretize  the  slower  dynamical  processes 
explicitly,  require  a  linear  solve  at  each  time-step.  These 
time-integrators,  while  only  conditionally  stable,  require 
less  computational  effort  than  fully-implicit  time  integrators, 
which  are  unconditionally  stable.  However,  engineering  ef¬ 
ficient  semi-implicit  time-integrators  which  scale  to  tens  of 
thousands  of  cores  remains  a  significant  challenge. 

The  efficiency  of  this  linear  solve  within  a  distributed 
memory  paradigm  is  critical  for  overall  model  efficiency. 
Therefore,  we  analyze  the  performance  of  two  different 
semi-implicit  time-integrators  within  distributed  memory  ar¬ 
chitectures.  The  goal  of  our  work-in-progress  is  to  compare 
the  scalability  of  these  solvers,  thus  determining  the  optimal 
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time-integrator  as  a  function  of  both  the  test  problem  and 
the  parallel  architecture.  The  rest  of  the  paper  is  structured 
as  follows:  In  Section  2  we  describe  the  NUMA  application 
and  our  approach  to  parallelization.  In  Section  3  we  dis¬ 
cuss  in  more  detail  the  semi-implicit  time-integrators  which 
we  employ  and  how  they  effect  the  efficiency  of  NUMA 
model.  In  Section  4  we  describe  the  systems  and  test  cases 
used  in  our  evaluation.  Section  5  presents  presents  scaling 
results  and  corresponding  performance  statistics.  We  draw 
our  conclusions  in  Section  6  and  also  describe  our  future 
plans. 

II.  The  Nonhydrostatic  Unified  Model  of  the 
Atmosphere  (NUMA)  Model 

A.  Governing  Equations 

NUMA  utilizes  the  fully  compressible,  non-hydrostatic 
Euler  equations  in  non-conservative  form  in  a  global  setting. 
These  equations  have  previously  been  considered  within  a 
semi-implicit  framework  for  2D  flows  [2]  and  more  recently 
for  3D  flows  [5]  in  Cartesian  geometries.  In  the  present 
study,  we  consider  three-dimensional  flow  ( x-y-z )  subject 
to  gravitational  and  Coriolis  forces 
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where  the  unknown  variables  are  (//,  ur,  O'),  where  p'  = 
p— po  is  a  density  perturbation,  u  =  (u,  v,  w)  is  velocity,  and 
O’  =  6  —  60  is  potential  temperature  perturbation.  In  Eq.  (1), 
V  v  denotes  a  vertical  gradient  and  r  is  a  unit  radial  vector. 
The  reference  states  po  and  60  are  hydrostatic  and  time- 
independent.  Defining  a  solution  vector  q  =  (p',ur,6'), 
Eq.  (1)  is  written  in  condensed  form  as 
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where  the  source  term  S(q)  is  a  right-hand  side  which  is 
discretized  by  the  spectral  element  method. 


B.  Spectral  Element  Discretization 

The  spectral  element  method  [1]  decomposes  the  spatial 
domain  D  C  R?  into  Ne  disjoint  elements  £le  via 

Ne 

=  U  Me  (3) 

e=l 

where  we  choose  Qe  as  hexahedra,  which  provides  1)  simple 
grid  generation  and  2)  efficient  (fast)  evaluation  of  the 
necessary  differentiation  and  integration  operators.  Within 
each  element  De,  a  finite-dimensional  approximation  is 
formed  by  expanding  q(x,  t)  in  basis  functions  i/jj  (x) 

Mat 

qN  (x,  i)  =  £  qj  (t)ipj  (x)  (4) 

3  =  1 

where  M n  =  (N  +  l)3  is  the  number  of  nodes  per  element 
and  N  is  the  order  of  the  basis  functions.  The  discrete  solu¬ 
tion  qAr  is  assumed  continuous  across  inter-element  bound¬ 
aries.  Basis  functions  are  constructed  as  tensor  products  of 
Lagrange  polynomials  ipi  (x)  =  ha(£)  ®  hp(rj)  ®  ft7(C)., 
where  ha(£)  is  the  Lagrange  polynomial  associated  with 
the  Legendre-Gauss-Lobatto  (LGL)  points  &  and  (£,  77,  £) 
are  functions  of  the  physical  variable  x.  Approximating  the 
prognostic  vector  q(x,  t)  by  a  finite-dimensional  approxima¬ 
tion  q^v  in  Eq.  (4),  multiplying  by  a  basis  function  and 
integrating  over  the  domain  Q  yields  the  weak  form 

[  i>i-wrd£l=  [  tpiS  (qjv)  dQ,  (5) 

Jfl  (Jt  JO. 

where  we  have  replaced  the  index  i  with  I  to  emphasize 
that  Eq.  (5)  is  now  a  global  representation.  Applying  the 
Galerkin  expansion  given  by  Eq.  (4)  to  Eq.  (5)  yields  the 
matrix- vector  equation 

^  =  MfjS(qi)  (6) 

where  the  mass-matrix  Mjj  =  fQ  'ipi'ipj  dQ  is  diagonal  if 
the  interpolation  and  integration  points  are  co-located.  This 
approximation  is  valid  for  N  >  4  while  incurring  a  small 
error  in  integration  [4].  Denoting  the  right-hand  size  (RHS) 
of  Eq.  (6)  by  R/(qj ),  Eq.  (6)  is  expressed  as 
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dt 
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where  A^i  denotes  the  global  assembly,  or  direct  stiffness 
summation  (DSS)  operator  that  maps  local,  element-wise 
coordinates  i  to  global  coordinates  I.  Eq.  (7)  forms  the  core 
of  the  spectral  element  method,  allowing  local,  element-wise 
information  q\e ^  to  propagate  to  adjacent  elements  via  the 
DSS  operator. 


C.  Parallelization 

III.  Impact  of  Time  Interators  on  Model 
Efficiency 

The  fast-moving  acoustic  and  gravity  waves  in  Eq.  (1) 
make  explicit  time-integration  unfeasible  do  to  the  stringent 
CFL  restriction.  Therefore,  we  have  developed  both  3D  and 
ID  semi-implicit  time-integrators  [3]  which  allow  much 
larger  time-steps.  Both  time-integrators  utilize  a  psuedo- 
Helmholtz  decomposition,  which  reduces  the  poorly  con¬ 
ditioned  monolithic  5 Np  x  5 Np  system  matrix  to  a  well- 
conditioned  Np  x  Np  matrix. 

A.  3D  Semi-Implicit  Time -Integrator 

In  the  3D  Semi-Implicit  (3D-SI)  time-integrator,  both  the 
horizantally  and  vertically  propagating  acoustic  and  gravity 
waves  are  discretized  implicitly,  resulting  in  following  linear 
system  which  must  be  solved  at  each  time- step: 

7^3  dPu  =  Rtt  (8) 

where  %3D  =  H( D,DT)  is  a  linear,  pseudo-Helmholtz 
operator  consisting  of  gradient  D  and  divergence  Dr  op¬ 
erators  operating  on  the  discretized  pressure  Ptt  and  Rtt  is 
an  effective  source  term.  Eq.  (8)  is  solved  iteratively  using  a 
Krylov  sub-space  method,  which  requires  global  reductions. 

B.  ID  Semi-Implicit  Time -Integrator 

In  contrast,  the  ID  Semi-Implicit  (ID-SI)  time-integrator, 
only  the  vertically  propagating  acoustic  and  gravity  waves 
are  discretized  implicitly.  Hence,  the  global  Np  x  Np  prob¬ 
lem  is  reduced  into  Nh,  independent  linear  systems  of  size 
Ny  x  Ny  which  are  solved  at  each  time- step 

H1DP$  =R(iA<i<NH.  (9) 

where  Nh  is  the  number  of  horizantal  grid  points,  Ny  is 
the  number  of  vertical  grid  points,  and  Np  =  Nh  *  Ny. 
In  Eq.  (9),  U1D  =  U  (p1D,T>JD)  is  a  linear,  pseudo- 
Helmholtz  operator  consisting  of  vertical  gradient  D  and 
divergence  Dr  operators.  When  combined  with  a  horizantol 
domain-decomposition,  no  inter-processor  communication  is 
required.  In  typical  global  NWP  problems,  Ny  <C  Nh- 

IV.  Scalability  and  Performance  Performance 
Analysis 

In  this  section  we  present  timings  and  performance  anal¬ 
ysis  statistics  for  the  ID  and  3D  integrators. 

A.  Test  Systems 

We  conducted  scalability  experiments  of  the  following 
systems: 

•  The  Constellation  Linux  Cluster  Ranger  is  located  at 
the  Texas  Advanced  Computing  Center  (TACC).  The 
Ranger  system  consists  of  of  3,936  16-way  SMP  com¬ 
pute  nodes.  Each  nodes  has  4  2.3  GHz  AMD  Quadcore 
Opteron™providing  15,744  AMD  Opteron  processors 
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Figure  1.  Scalability  of  the  ID  Integrator 


Figure  2.  Scalability  of  the  3D  Integrator  expressed 


for  a  total  of  62,976  compute  cores,  123  TB  of  total 
memory  and  1.7  PB  of  raw  global  disk  space.  It  has 
a  theoretical  peak  performance  of  579  TFLOPS.  All 
Ranger  nodes  are  interconnected  using  InfiniBand  tech¬ 
nology  in  a  full-CLOS  topology  providing  a  lGB/sec 
point-to-point  bandwidth.  For  further  details  see  [6]  . 

•  The  Lonestar  Linux  Cluster  is  also  located  at  TACC. 
It  consists  of  1,888  12way  nodes,  with  two  6- 
Core  Intel  Xeon  Intel  Hexa-Core  64-bit  3. 3 3 GHz 
Westmere™processors  on  a  single  as  an  SMP  unit.  This 
provides  for  a  total  of  22,656  cores.  It  is  configured 
with  44  TB  of  total  memory  and  276TB  of  local 
disk  space.  The  peak  performance  is  302  TFLOPS. 
Nodes  are  interconnected  with  InfiniBand  technology 
in  a  fat-tree  topology  with  a  40Gbit/sec  point-to-point 
bandwidth.  For  further  details  see  [7]. 

•  The  Cray  XT5  system  Kraken  is  located  at  the  National 
Institute  for  Computational  Sciences  (NICS)  at  the 
University  of  Tennesee  in  Oakridge.  The  system  runs 
the  Cray  Linux  Environment  (CLE)  2.2.  It  consists  of 
12way  9408  compute  nodes,  each  node  containing  two 
2.6GHz  6  core  AMD  Opteron  Istambul™prcessors, 
providing  a  total  of  112,896  compute  cores.  The  nodes 
are  connected  by  the  Cray  SeaStar2+  router.  For  more 
information  see  [8]. 

B.  Scalability  Results  and  Analysis 

We  ran  our  test  case  for  about  1400  time  steps.  Figures 
1  and  2  show  the  scalability  of  the  ID  and  3D  integra¬ 
tors  expressed  as  decrease  in  elapsed  execution  time  with 
increasing  number  of  MPI  processes.  The  timings  indicate 
that  the  ID  Integrator  scales  well  on  all  systems  and  yields 
shorter  execution  time  than  the  3D  integrator. 

The  timings  indicate  that  the  3D  integrator  does  not  scale 
beyond  144  processes  on  all  systems.  Using  more  than  288 
MPI  processes  actually  yields  negative  scalability:  that  is 
an  increase  in  execution  time  when  increasing  the  number 
of  MPI  processes.  The  reason  for  the  poor  scalability  is 
the  excessive  time  spent  in  global  reductions,  which  is 
required  by  the  GMRES  iterative  solver.  We  employed  the 
TAU  [9]  performance  analysis  tool  on  the  Lonestar  system 


for  a  run  with  864  MPI  processes.  The  timing  statistics 
displayed  in  Figure  3  show,  that  77%  of  the  time  is  spent 
in  global  communication  caused  by  a  huge  number  of  calls 
to  MPI_Allreduce  short  message  size.  The  test  case  under 
consideration  is  therefore  a  good  candidate  for  the  usage 
of  the  ID  integrator,  yielding  better  scalability  as  well  as  a 
shorter  execution  time  than  the  3D  integrator. 

V.  Conclusions  and  Future  Work 

From  the  experiments  and  that  we  have  conducted  so  far 
we  conclude  that  our  implementation  of  the  3D  integrator 
excessive  MPI  communication  time  due  to  global  reductions, 
at  least  for  certain  sets  of  input  data.  In  our  future  work 
we  will  focus  on  issues:  We  will  investigate  possible  opti¬ 
mizations  for  the  3D  integrator,  including  the  usage  of  other 
Krylov  subspace  methods.  Second,  we  will  conduct  many 
more  experiments  on  different  types  of  input  data  sets.  Our 
goal  is  to  derive  characteristics  of  the  application  runtime 
behavior  in  order  to  choose  the  most  efficient  solver. 
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